About the project

The IODS course offers training in data wrangling and analysis in an intensive but pleasant way. Datacamp offered a fun way to learn more about R. Working on Github showed the great potential of this system.


RStudio Exercise 2

This week datawrangling, regression and model validation have been central in this course.

1. Read the data and explore dimensions and summaries

data_url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt"
lrn14 <-read.table(data_url, sep=",", header = TRUE)
dim(lrn14)
## [1] 166   7
str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

The data consists of 166 students with the following 7 variables:

  1. Gender: 1 = Male and 2 = Female.
  2. Age: The age of the students in years.
  3. Attitude: Global attitude toward statistics as a mean of various variables.
  4. Deep: Parameter calculated as the mean of various variables from the deep learning questions.
  5. Stra: Parameter calculated as the mean of various variables from the strategic learning questions.
  6. Surf: Parameter calculated as the mean of various variables from the surface learning questions.
  7. Points: Maximal exam points.

2. Summary of the variables and grafical overview

summary(lrn14)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

The group consists of 110 females and 56 males. Their median age is 22 years with range [17-55]. The students got a median of 3.2 points with range [1.4-5] for attitude, a median of 3.7 with range [1.6-4.9] for deep learning approach (deep), a median of 3.2 with range [1.3-5] for strategic learning approach (stra), a median of 2.9 with range [1.6-4.3] for surface learning approach (surf) and a median of 23 points (points) for the exam with range [7-33].

pairs(lrn14[-1], col=lrn14$gender)

The grafical summary above shows the relationship between the values of the different measurements with bivariable scatter plots. The colour of the scatter points shows males as black dots and females as red dots.

library(GGally)
library(ggplot2)
p <- ggpairs(lrn14, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

The grafic above displays in more different ways the data than the bivariable scatterplot. The distribution of the results is normal, except for the parameters age and deep which are respectively right-skewed and left-skewed. Within the material is the relationship between the attitude toward statistics and final exam points the strongest with a positive correlation of 0,4 for both sexes. The second most strong correlation is seen between deep learning approach and surface learning approach. This correlation of 0,6 is negative and specific for the male group, for both groups the correlation is -0,3.

3. Regression model

model_1 <-lm(points ~ attitude + stra + deep, data = lrn14)
summary(model_1)
## 
## Call:
## lm(formula = points ~ attitude + stra + deep, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## stra          0.9621     0.5367   1.793  0.07489 .  
## deep         -0.7492     0.7507  -0.998  0.31974    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

In the first model there is a clear significance with low p-value for attitude. The “deep”-variable will be removed and the model will be tested again.

model_2 <-lm(points ~ attitude + stra, data = lrn14)
summary(model_2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

In this model is attitude again the only significant variable, that explains changes in the exam points. The model will be runned again without the “stra”-variable.

model_3 <-lm(points ~ attitude, data = lrn14)
summary(model_3)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

4. Interpretation and explanation of model parameters

Residuals: The residuals show the difference between the values predicted by the model and the actual values. A good predictive value should have a symmetric distribution of the residuals.
Coefficients: The intercept is the expected mean value of Y when all X=0. The attitude’s slope is estimated at 3.53, meaning that the maximum test points change with 3,5 points for every point of attitude change. Standard Error should fall within plus/minus 2 times the standard error of the regression from the regression line within a 95% prediction interval.The Pr(>t) acronym found in the model output relates to the probability of observing any value equal or larger than t. A small p-value indicates that it is unlikely we will observe a relationship between the predictor (attitude) and response (points) variables due to chance. Three asterisks represent a highly significant p-value.
Residual standard error: Residual Standard Error is measure of the quality of a linear regression fit. The Residual Standard Error is the average amount that the response (points) will deviate from the true regression line.
R-squared statistic provides a measure of how well the model is fitting the actual data. A value of null means that the variable doesn’t explain the variance in the dependent; 1 means the total opposite. In this case roughly 19 % of the variance found in the response variable (points) can be explained by the predictor variable (attitude).
F-statistic is a good indicator whether there is a relationship between the variable and the dependent value. The further away from 0 the better the reliability of the test and the possibilty to reject the H0 (Noll hypothesis: " There is no relationship between attitude and points.“).

5. Diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage

plot(model_3, which = c(1:2, 5))

Residuals vs Fitted values: In this plot there is a linear relationship between the predictor variables and the outcome variables, confirming the assumption that there are no non-linear relationships between the parameters in the test and the linear model fits well for this data. Some outliers can be observed in the model.
Normal Q-Q: The relationship between the theoretical and standarized quantiles is mainly following a straight line, confirming that the choice for a linear model was the correct one. In the lower left corner there are some outliers observed.
Residuals vs Leverage: Some outliers could be identified in the residuals vs fitted values, but they do not show high leverage on the observations and the model can be used to discribe the relationship between ‘attitude’ and ‘points’. The outliers in this case can be kept in the model. If they were due to possible registration errors in the data and had a high leverage on the model, they could be removed.


RStudio Exercise 3

This week datawrangling, logistic regression, odds ratio and prediction have been central in this course.

1. Create this file

2. Read the joined student alcohol consumption data

data_url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc <-read.table(data_url, sep=",", header = TRUE)
dim(alc)
## [1] 382  35
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The data consists of 382 observations with the following 35 variables:

  1. school: student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
  2. sex: student’s sex (binary: ‘F’ - female or ‘M’ - male)
  3. age: student’s age (numeric: from 15 to 22)
  4. address: student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
  5. famsize: family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
  6. Pstatus: parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
  7. Medu: mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - " 5th to 9th grade, 3 - " secondary education or 4 - " higher education)
  8. Fedu: father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - " 5th to 9th grade, 3 - " secondary education or 4 - " higher education)
  9. Mjob: mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  10. Fjob: father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  11. reason: reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
  12. guardian: student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
  13. traveltime: home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
  14. studytime: weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
  15. failures: number of past class failures (numeric: n if 1<=n<3, else 4)
  16. schoolsup: extra educational support (binary: yes or no)
  17. famsup: family educational support (binary: yes or no)
  18. paid: extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
  19. activities: extra-curricular activities (binary: yes or no)
  20. nursery: attended nursery school (binary: yes or no)
  21. higher: wants to take higher education (binary: yes or no)
  22. internet: Internet access at home (binary: yes or no)
  23. romantic: with a romantic relationship (binary: yes or no)
  24. famrel: quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
  25. freetime: free time after school (numeric: from 1 - very low to 5 - very high)
  26. goout: going out with friends (numeric: from 1 - very low to 5 - very high)
  27. Dalc: workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
  28. Walc: weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
  29. health: current health status (numeric: from 1 - very bad to 5 - very good)
  30. absences: number of school absences (numeric: from 0 to 93)
    The following grades are related with the course subject, Math or Portuguese:
  31. G1: first period grade (numeric: from 0 to 20)
  32. G2: second period grade (numeric: from 0 to 20)
  33. G3: final grade (numeric: from 0 to 20, output target)
  34. alc_use: average of ‘Dalc’ and ‘Walc’
  35. high_use: TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise

3. Four interesting variables in the data and for each of them, present your personal hypothesis

I would chose sex, quality of family relationship, going out with friends and the willingness to take higher education as variables for further investigastion. I would believe that male students, that go out a lot would use more alcohol. Willingness to take higher education and a good family relationship could be intuitively associated with low alcohol consumption.

4.Numerical and graphical exploration of the distributions of the variables above and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots)

library(tidyr); library(dplyr); library(ggplot2)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ggplot(alc, aes(x = high_use, y = age, col = sex)) + geom_boxplot() + ylab("age") + ggtitle("Student age by high alcohol use and sex")

ggplot(alc, aes(x = high_use, y = goout, col = sex)) + geom_boxplot() + ylab("Going out") + ggtitle("Going out with friends by high alcohol use and sex")

ggplot(data = alc, aes(x =goout, fill = high_use)) + geom_bar() + xlab("Going out") + ggtitle("Going out with friends and high alcohol use distribution")

ggplot(alc, aes(x = high_use, y = famrel, col = sex)) + geom_boxplot() + ylab("Family relationship") + ggtitle("Family relationship by high alcohol use and sex")

ggplot(data = alc, aes(x = famrel, fill = high_use)) + geom_bar() + xlab("Family relationship") + ggtitle("Family relationship and high alcohol use distribution")

table(alc$high_use, alc$sex)
##        
##           F   M
##   FALSE 157 113
##   TRUE   41  71
alc %>% group_by(famsup) %>% summarise(count = n(), mean_alc_use = mean(alc_use))
## # A tibble: 2 x 3
##   famsup count mean_alc_use
##   <fctr> <int>        <dbl>
## 1     no   144     1.954861
## 2    yes   238     1.829832
alc %>% group_by(alc_use) %>% summarise(count = n(), mean_goout = mean(goout))
## # A tibble: 9 x 3
##   alc_use count mean_goout
##     <dbl> <int>      <dbl>
## 1     1.0   144   2.736111
## 2     1.5    68   2.882353
## 3     2.0    58   3.120690
## 4     2.5    42   3.309524
## 5     3.0    32   4.000000
## 6     3.5    17   3.764706
## 7     4.0     9   4.222222
## 8     4.5     3   3.666667
## 9     5.0     9   4.222222

Student age by high alcohol use and sex: In the student age by high alcohol use and sex boxplot, men seem to use at a higher age more alcohol. Numerically there doesn’t seem to be a relationship based on the mean ages for the whole group.
Going out with friends by high alcohol use and sex: There seems to be a relationship between going out frequently with friends and high alcohol use.
Family relationship by high alcohol use and sex: A good family relationship seems to protect against high use of alcohol.

library(tidyr); library(dplyr)
table(alc$higher, alc$high_use) %>% prop.table %>% addmargins()*100
##      
##            FALSE       TRUE        Sum
##   no    2.356021   2.356021   4.712042
##   yes  68.324607  26.963351  95.287958
##   Sum  70.680628  29.319372 100.000000

In the crosstabulation above between the willingness to pursue higher education (“no”, “yes”) and high use of alcohol (“FALSE”, “TRUE”) the numbers are expressed as percentage of the whole studymaterial. Students having a high alcohol use seem not to pursue higher education in the same proportion as the students that drink less.

5. Explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable

library(tidyr)
alc$famrel <- factor(alc$famrel)
alc$goout <- factor(alc$goout)
m <- glm(high_use ~ alc$famrel + alc$goout + higher + sex, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ alc$famrel + alc$goout + higher + sex, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5569  -0.7303  -0.4874   0.7327   2.4733  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.9444     1.2071  -1.611 0.107226    
## alc$famrel2   0.3801     1.0768   0.353 0.724127    
## alc$famrel3   0.4501     0.9709   0.464 0.642952    
## alc$famrel4  -0.3165     0.9489  -0.334 0.738730    
## alc$famrel5  -0.9362     0.9710  -0.964 0.334934    
## alc$goout2    0.2091     0.6940   0.301 0.763227    
## alc$goout3    0.5296     0.6768   0.782 0.433976    
## alc$goout4    2.0342     0.6753   3.012 0.002594 ** 
## alc$goout5    2.4735     0.6968   3.550 0.000385 ***
## higheryes    -0.3391     0.5666  -0.598 0.549523    
## sexM          0.9851     0.2635   3.739 0.000185 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 378.16  on 371  degrees of freedom
## AIC: 400.16
## 
## Number of Fisher Scoring iterations: 4

According to the logistic regression model, males are significantly more prone to high alcohol use. Also frequently going out correlates significantly with high alcohol use. Lack of family support or pursuing higher education do not correlate significantly with alcohol consumption in this model.

OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                     OR      2.5 %    97.5 %
## (Intercept)  0.1430801 0.01029021  1.270211
## alc$famrel2  1.4623806 0.20016974 15.231512
## alc$famrel3  1.5684279 0.27441876 13.984373
## alc$famrel4  0.7286997 0.13357043  6.261852
## alc$famrel5  0.3921018 0.06771719  3.457162
## alc$goout2   1.2325383 0.35085792  5.807814
## alc$goout3   1.6982026 0.50605996  7.820783
## alc$goout4   7.6459609 2.29792372 35.198633
## alc$goout5  11.8638973 3.39834985 56.396606
## higheryes    0.7124103 0.23173686  2.193897
## sexM         2.6782042 1.60825469  4.529032

The odds ratios show 2.7 times higher risk for males to have a high alcohol consumption. Also going out frequently (4) to very frequently (5) adds respectively 7.6 and 11.9 times the risk of high alcohol consumption.

Conclusion:
In the comparison between the logistic regression model and the expected relationships based upon the box and bar plots; lack of family support and pursuing higher education do not correlate significantly with alcohol consumption in this model. Male sex and going out are significant risk factors for high consumption of alcohol.

probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()

table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65445026 0.05235602 0.70680628
##    TRUE  0.14921466 0.14397906 0.29319372
##    Sum   0.80366492 0.19633508 1.00000000

The prediction missed 14.9 % true cases of high alcohol use and predicted 5.2 % of high_use, that weren’t high_use cases.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2015707

The prediction of the logistic regression model deviates 20 % of the actual values.

7. Bonus 10-fold cross-validation

library(boot)
m <- glm(high_use ~ famrel + goout + higher + sex, data = alc, family = "binomial")
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta
## [1] 0.2041885 0.1995217

The predictive model with a loss function of 20.2 % was better than prediction error of 22.5 % of the cross validation.


RStudio Exercise 4

This week clustering and classification are central in this course.

1-2. Create this file and describe the data

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

In the “Boston”-data set there are 506 observations with the following 14 variables:

  1. crim: per capita crime rate by town.
  2. zn: proportion of residential land zoned for lots over 25,000 sq.ft.
  3. indus: proportion of non-retail business acres per town.
  4. chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  5. nox: nitrogen oxides concentration (parts per 10 million).
  6. rm: average number of rooms per dwelling.
  7. age: proportion of owner-occupied units built prior to 1940.
  8. dis: weighted mean of distances to five Boston employment centers.
  9. rad: index of accessibility to radial highways.
  10. tax: full-value property-tax rate per $10,000.
  11. ptratio: pupil-teacher ratio by town.
  12. black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  13. lstat: lower status of the population (percent).
  14. medv: median value of owner-occupied homes in $1000s.

3. Grafical overview and summary

pairs(Boston)

This plot isn’t very helpful for interpretation of the data, but shows some correlative relationships. A clearer statement about correlations will be made based upon the correlationsplot and -table below.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

In the summary there are som non-parametrically distributed parameters as crim, zn, rad, age, tax, indus and black. Other parameters can be distributed parameterically. In case of doubt a test for normality could be done.

library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v tibble  1.3.4     v purrr   0.2.4
## v readr   1.1.1     v stringr 1.2.0
## v tibble  1.3.4     v forcats 0.2.0
## -- Conflicts --------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x MASS::select()  masks dplyr::select()
library(corrplot)
## corrplot 0.84 loaded
cor_matrix<-cor(Boston) %>% round(digits = 2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
corrplot(cor_matrix, method="circle", type = "upper", cl.pos= "b", tl.pos = "d", tl.cex = 0.6)

There seem to be negative correlations between dis versus age, nox, indus and tax, suggesting that the further from the employment centers the buildings are younger, there is less nitrous oxide pollution ,the proportion of non-retail businesses is lower and the full-value property-tax rate is higher. The correlation between medv and lstat is outspoken negative, meaning that areas with lower social status of the population have cheaper owner-occupied houses and vice versa.

4. Standarized data set and division into training and test set

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
boston_scaled <- as.data.frame(boston_scaled)

The data is scaled to the values above to eliminate the effect of different values on the following analyses. The data is divided into a training and test set.

# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins,  labels = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set 
test <- boston_scaled[-ind,]

5. Linear discriminant analysis

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2400990 0.2500000 0.2574257 
## 
## Group means:
##                   zn      indus        chas        nox          rm
## low       0.95103552 -0.8976578 -0.11793298 -0.8625810  0.43623027
## med_low  -0.06244711 -0.3059076 -0.02879709 -0.6141849 -0.10369971
## med_high -0.41761791  0.3052052  0.15646403  0.4240387 -0.02236438
## high     -0.48724019  1.0170690 -0.04518867  1.0596979 -0.36101257
##                 age        dis        rad        tax    ptratio
## low      -0.8548529  0.8421226 -0.6981323 -0.7554739 -0.4848396
## med_low  -0.4368767  0.4243095 -0.5568200 -0.4940026 -0.0666037
## med_high  0.4035951 -0.3766379 -0.4144603 -0.2618898 -0.1747415
## high      0.8237491 -0.8544356  1.6386213  1.5144083  0.7813507
##                black       lstat        medv
## low       0.37520299 -0.74718728  0.52611803
## med_low   0.32725940 -0.19185853  0.03185425
## med_high  0.04238717  0.08230273  0.05661119
## high     -0.62155418  0.88927932 -0.67685588
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.07990962  0.782572112 -1.05512897
## indus    0.02378110 -0.328247940  0.30406150
## chas    -0.07334589 -0.051332646  0.14151476
## nox      0.30883536 -0.699160371 -1.29919109
## rm      -0.08238630 -0.009264495 -0.14411116
## age      0.23026538 -0.252897529 -0.20772962
## dis     -0.06691105 -0.419651299  0.38418220
## rad      3.10545145  1.089529449 -0.03475214
## tax      0.10637545 -0.192365103  0.63038432
## ptratio  0.08119324 -0.047351452 -0.26054440
## black   -0.09575810  0.055555256  0.12131688
## lstat    0.22387033 -0.217452657  0.30025984
## medv     0.14884453 -0.408523426 -0.21863940
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9490 0.0375 0.0135
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit , dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

In this example the LD1-model explains 95 % of the clustering with the index of accessibility to radial highways (rad) as the most influential linear separator.

6. Predict LDA model on test data

# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       13      12        0    0
##   med_low    3      16       10    0
##   med_high   0       7       17    1
##   high       0       0        0   23

The model predicts well the true values for the categorical crime rate. Most of the wrong predictions were adressed to the med_low crime group.

7. k-means algoritm

#reload Boston and scale
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# scaled distance matrix
dist_scaled <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_scaled)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(dist_scaled, centers = 2)

# plot the Boston dataset with clusters

km <-kmeans(dist_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)

The optimal number for clusters, according to k-means algorithm is 2. In the correlation plot above, we can see that the plots with strong correlations between the parameters, show for most of the variables a division into two clusters, according to k-means method.

K-means with 4 clusters and LDA

km <-kmeans(dist_scaled, centers = 4)
lda.fit <- lda(km$cluster~., data = boston_scaled)
lda.fit
## Call:
## lda(km$cluster ~ ., data = boston_scaled)
## 
## Prior probabilities of groups:
##         1         2         3         4 
## 0.2272727 0.1304348 0.4308300 0.2114625 
## 
## Group means:
##         crim         zn      indus       chas        nox         rm
## 1  0.2797949 -0.4872402  1.1892663 -0.2723291  0.8998296 -0.2770011
## 2  1.4330759 -0.4872402  1.0689719  0.4435073  1.3439101 -0.7461469
## 3 -0.3894453 -0.2173896 -0.5212959 -0.2723291 -0.5203495 -0.1157814
## 4 -0.3912182  1.2671159 -0.8754697  0.5739635 -0.7359091  0.9938426
##          age        dis        rad        tax     ptratio       black
## 1  0.7716696 -0.7723199  0.9006160  1.0311612  0.60093343 -0.01717546
## 2  0.8575386 -0.9620552  1.2941816  1.2970210  0.42015742 -1.65562038
## 3 -0.3256000  0.3182404 -0.5741127 -0.6240070  0.02986213  0.34248644
## 4 -0.6949417  0.7751031 -0.5965444 -0.6369476 -0.96586616  0.34190729
##        lstat        medv
## 1  0.6116223 -0.54636549
## 2  1.1930953 -0.81904111
## 3 -0.2813666 -0.01314324
## 4 -0.8200275  1.11919598
## 
## Coefficients of linear discriminants:
##                 LD1        LD2         LD3
## crim    -0.18113078 -0.5012256 -0.60535205
## zn      -0.43297497 -1.0486194  0.67406151
## indus   -1.37753200  0.3016928  1.07034034
## chas     0.04307937 -0.7598229 -0.22448239
## nox     -1.04674638 -0.3861005 -0.33268952
## rm       0.14912869 -0.1510367  0.67942589
## age      0.09897424  0.0523110  0.26285587
## dis     -0.13139210 -0.1593367 -0.03487882
## rad     -0.65824136  0.5189795  0.48145070
## tax     -0.28903561 -0.5773959  0.10350513
## ptratio -0.22236843  0.1668597 -0.09181715
## black    0.42730704  0.5843973  0.89869354
## lstat   -0.24320629 -0.6197780 -0.01119242
## medv    -0.21961575 -0.9485829 -0.17065360
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.7596 0.1768 0.0636
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 3)

In this example the LD1-model explains only 76 % of the clustering with rad, nox, indus and zn as the most influential linear separators. K-mean algorithm with 4 centers isn’t a better model to cluster this data.

Plotly

model_predictors <- dplyr::select(train, -crime)
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
#k-means
dist_train <- dist(train)
## Warning in dist(train): NAs introduced by coercion
km <-kmeans(dist_train, centers = 2)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
#Matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library("plotly")
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
#Create a 3D plot of the columns of the matrix product
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=km$cluster)

We can see that the high risk and med_high risk crime groups from the LDA-trained set overlap with one cluster according to the k-means algorithm for two cluster. The other cluster, according to k-means algoritm overlaps with low and med_low crime groups from the LDA-trained set.